Project 3¶
Author: Maja Noack Date: 2025-11-17
part_1_2¶
Part 1: Lightweight Object Detection¶
Implement and train a lightweight detection model (SSD, Faster R-CNN with MobileNet, or YOLO). Object detection datasets are typically larger than the toy example datasets like MNIST used in image classification, here is a banana detection dataset that can be used, where the authors took photos of bananas and generated 1000 banana images with different rotations and sizes.
The banana detection dataset was downloaded as described and imported via the provided code.
To build the SSD
Discussion:¶
The loss curves plot shows that both boundingbox and class error sharply decrase to under 0.005 after the first couple epochs of training. This indicates that the model quickly learns to localize and classify bananas with high precision. The nearly identical trends of the two curves further suggest that the offset regression and classification branches are learning at similar rates. As visible in the validation images of the bananas dataset the model localizes the bananas in the images with high confidence (over 0.9 for most images). However for the images I took myself in real world setting the model is unable to identify the bananas even when the confidence threshold this decreased drastically (0.3). This suggests that the model has overfit the dataset and learned do identify only the artifically placed bananas of the dataset. As the model has not been trained on any real life images of bananas that can vary in size, shape and angle more dramastically than the bananas in the training images it struggles with detecting them. Even though the first image I included has a banana photographed from a very similar angle and with similar size to the dataset bananas the model is still unable to detect it. Notably, the dataset bananas have a solid dark border around them, and look like cut-out bitmap graphics that have been placed on the image. These strong edges provide highly discriminative features that CNN can easily pick up, rather than the blended shapes the bananas in my selfmade images exhibit. This could explain the difficulty the model has detecting real world bananas.
Part 2: Non-Maximum Suppression (NMS)¶
Before Non-Maximum Likelihood Suppression:
After Non-Maximum Likelihood Suppression:
pytorch
Discussion: Compare with PyTorch’s NMS implementation. Any difference? Discuss its purpose and limitations.¶
Non-Maximum Suppression (NMS) functions in object detection to eliminate duplicate bounding boxes which identify the same object. The detection process of modern detectors generates numerous overlapping boxes with equivalent confidence levels so NMS selects the most confident box while removing all other boxes. As visible in the results without NMS, even if the confidence threshold is set resonably high (0.7) many bounding boxes overlap each other. By introducing NMS the number of bounding boxes per object can often be reduced to one as visible in my and pytorchs implementation. Both NMS implementations produce the same type of suppression behavior but they differ in execution speed. My custom NMS is a clear and readable Python implementation however it is relatively slow due to its Python loops and less numerically stable because IoU and suppression decisions depend directly on floating-point operations in Python. Pytorchs implementation is highly a highly optimized C++/CUDA implementation needing only about half as much time to compute. While both methods give similar qualitative outputs, the PyTorch version is more efficient
part_3¶
Part 3: Human–Object Interaction (HOI) Analysis using VLM¶
Perform zero-shot HOI analysis using VLMs on a subset of HICO-DET dataset (huggingface zhimeng/hico_det · Datasets at Hugging Face).
- Use one (or more) open-source or closed-source VLMs (e.g., Gemini, GPT, LLaVA, Qwen) to predict human–object interactions.
- Come up with your prompt to guide the VLMs to predict
. For example, , .
- Come up with your prompt to guide the VLMs to predict
Example image of the HICO-DET test dataset.
Qwens HICO-DET implementation has be adapted from the Original Qwen Hugginface example. 6 images from the test set of HICO-DET get loaded and the model gets tasked to provide bounding boxes for the human and object and label the interaction and the object in json format
Discussion: Can you identify a few failure cases where VLMs fail to prediction the HOI classes for the given images? If so, discuss the possible reasons.¶
The first prompt given to the model was the following:
" You are a Human-Object Interaction (HOI) detector. Detect all humans, objects and the action (verb) between them.For each HOI, return:
- verb (action)
- object
- bounding box for human [x1,y1,x2,y2]
- bounding box for object [x1,y1,x2,y2] Return ONLY valid JSON with the format: { "hois": [ {"verb": "...", "object": "...", "bbox_human": [...], "bbox_object": [...] }, ... ] } "
As visible in the example images the prediction from the model is very unstable. Half of the predictions where not provided in complete json format and therefor not accepted as predictions (no_prediction). Furthermore, the bounding boxes do not overlap the human and object properly or are sometimes not even in the picture (image 1) or both object box and human box are predicted as identical as visible in image 4 and 5. Instead of predicting the object label the model labeled the human in the image (i.e. "man", "people"). There are several possible resons for the current behaviour. Due to the relatively small size of the model, it may struggle to follow complex structured prompts or maintain consistent JSON formatting across inputs. Smaller models also often have weaker grounding capabilities which would explain the hallucinated and identical object and human boxes. The model could also lack sufficient capacity to disentangle human and object regions when multiple entities overlap because of its limited size. As VLMs are highly prompt-sensitive, even minor variations in phrasing or context can further destabilize predictions, leading to inconsistent HOI outputs.
Give it a try to fix the failure cases via better prompts or few-shot examples (in-context learning). Discuss if your solution works or the failure cases still cannot be solved.¶
The prompt was updated to:
"You are a Human-Object Interaction (HOI) detector.
Detect all humans and objects in the image.
For each HOI, return:
- human (always use the string "person")
- verb (the action)
- object (the interacted object)
- bbox_human [x1, y1, x2, y2]
- bbox_object [x1, y1, x2, y2]
Return ONLY valid JSON in the following format:
{ "hois": [ { "human": "person", "verb": "...", "object": "...", "bbox_human": [x1, y1, x2, y2], "bbox_object": [x1, y1, x2, y2] }, ... ] }
Example (for illustration only):
{ "hois": [ { "human": "person", "verb": "ride", "object": "bicycle", "bbox_human": [120, 50, 300, 400], "bbox_object": [100, 200, 350, 450] } ] } "
Two major changes where made. The model was encouraged to provide a seperate fixed key–value pair for the human in the detection to avoid the model prediction the human as an object. Additionally the prompt was extentded to a few shot set by providing an example prediction. This improved the predictions quality of the model significantly. Now all 6 images have predictions, some of them even identical with the GT of the image (i.e. image 0: sit bench; image 1: ride horse, walk horse; image 3: sit motorcycle). However, although the model was able to identify many correct interactions, it was still unable to properly localize them, with bounding boxes still being randomly spread accoss and even outside the image and some of the objects being predicted several times more often than they appeared in the image(i.e. image 1 horse, image 3 motorcyle). This shows that although prompt eneneering was able to improve the models prediction capabilities it did not improve its grounding capabilities. These seem to be likely due to its limited size and possible missing finetuning for grounding.